summary(data)
## radius texture perimeter area
## Min. : 6.981 Min. : 9.71 Min. : 43.79 Min. : 143.5
## 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17 1st Qu.: 420.3
## Median :13.370 Median :18.84 Median : 86.24 Median : 551.1
## Mean :14.127 Mean :19.29 Mean : 91.97 Mean : 654.9
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10 3rd Qu.: 782.7
## Max. :28.110 Max. :39.28 Max. :188.50 Max. :2501.0
## smoothness compactness concavity concave.points
## Min. :0.05263 Min. :0.01938 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956 1st Qu.:0.02031
## Median :0.09587 Median :0.09263 Median :0.06154 Median :0.03350
## Mean :0.09636 Mean :0.10434 Mean :0.08880 Mean :0.04892
## 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070 3rd Qu.:0.07400
## Max. :0.16340 Max. :0.34540 Max. :0.42680 Max. :0.20120
## symmetry fractal.dimension
## Min. :0.1060 Min. :0.04996
## 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.1792 Median :0.06154
## Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.3040 Max. :0.09744
When we check the descriptive statistics of the dataset:
Mean of the radius variable is higher then than the median. This means that radius variable is right-skewed. When we compare the values of the third quantile and maximum value, it can be referred that there might be some outliers in the variable. However, it is best to make the final judgement with boxplot analysis.
Mean of the texture variable is higher then than the median. This means that texture variable is right-skewed. When we compare the values of the third quantile and maximum value, it can be referred that there might be some outliers in the variable. However, it is best to make the final judgement with boxplot analysis.
Mean of the perimeter variable is higher then than the median. This means that perimeter variable is right-skewed. When we compare both the values of the third quantile and maximum value, and min and first quantile value, it can be referred that there might be some outliers in the variable. However, it is best to make the final judgement with boxplot analysis.
From the descriptive statistics, it can easily be stated that values differ a lot. The dataset must be scaled. However, if there is high correlation between variable pairs, PCA needs to be applied to the dataset.
corr <- cor(data, method = "spearman")
corrplot(corr, method="color" )
Correlation is analyzed with spearman method. The reason behind this the assumption of normality in pearson method.
Correlation matrix of the dataset reveals that there are high correlation between variable pairs. Correlation between Radius, perimeter, and area variables are more than 0.98. These are too much. Correlation between Compactness, concavity, and concave points variables are more that 0.83. Fractal Dimension is the only variable with negative correlation.
par(mfrow=c(2,5))
boxplot(data$radius, main = "radius", col = "dodgerblue2")
boxplot(data$texture, main = "texture", col = "dodgerblue2")
boxplot(data$perimeter, main = "perimeter", col = "dodgerblue2")
boxplot(data$area, main = "area", col = "dodgerblue2")
boxplot(data$smoothness, main = "smoothness", col = "dodgerblue2")
boxplot(data$compactness, main = "compactness", col = "dodgerblue2")
boxplot(data$concavity, main = "concavity", col = "dodgerblue2")
boxplot(data$concave.points, main = "concave points", col = "dodgerblue2")
boxplot(data$symmetry , main = "symmetriy", col = "dodgerblue2")
boxplot(data$fractal.dimension , main = "fractal dimension", col = "dodgerblue2")
It can be observed from the boxplots that there are lots of outliers in each variables. Variance of the some variables(radius, texture, perimeter, concavity, concave points) is very high.
indexes = sapply(df2, is.numeric)
indexes["Diagnosis"] = TRUE
df2[,indexes]%>%
gather(-Diagnosis, key = "var", value = "value") %>%
ggplot(aes(x = value, y = Diagnosis, color = Diagnosis)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free")+
theme(axis.text.x = element_text(angle = 30, hjust = 0.85),legend.position="none",
panel.background = element_rect(fill = "white"))+
theme(strip.background =element_rect(fill="goldenrod1"))+
theme(strip.text = element_text(colour = "firebrick3"))
When the boxplots of the variables according to the levels of Diagnosis are examined, it is noticed that the M level takes higher values for almost every variable. This is not only valid for the fractal.dimension variable. Again, it was noticed that the variance of the M level was higher for all variables except the fractal.dimension variable. This makes the dataset clusterable.
Principal component analysis (PCA) is a technique used to identify patterns in a dataset. It does this by identifying the directions (or “components”) in the data that account for the most variation. The first component is the direction in the data that accounts for the most variation, the second component is the direction in the data that accounts for the second most variation, and so on. [1], [2]
Here is a step-by-step explanation of how PCA is calculated:
Standardize the data: The data is transformed so that each variable has a mean of zero and a standard deviation of one. This is done to ensure that all variables are on the same scale.
Compute the covariance matrix: This matrix is calculated to determine the relationship between the variables in the dataset.
Compute the eigenvectors and eigenvalues of the covariance matrix: Eigenvectors represent the directions in the data that account for the most variation, and eigenvalues represent the amount of variation that is accounted for by each eigenvector.
Select the principal components: The eigenvectors with the highest eigenvalues are chosen as the principal components of the dataset.
Transform the data: The original dataset is transformed by projecting it onto the principal components, resulting in a new dataset with reduced dimensionality.
Interpret the components: The principal components are interpreted in terms of the original variables to understand the underlying patterns in the data.
Applying PCA to the dataset before clustering has several advanteges that can be listed as follows:
Dimensionality Reduction: PCA can be used to reduce the number of features in a high-dimensional dataset, which can help improve the performance of a clustering algorithm. By reducing the dimensionality, PCA can also help to reduce noise and eliminate multicollinearity in the data, making it easier to interpret the results of a clustering analysis. [3]
Visualization: PCA can be used to visualize high-dimensional datasets in two or three dimensions, making it easier to understand the structure of the data and identify clusters. This can be especially useful for large datasets with many features, as it is often difficult to visualize and interpret the results of a clustering analysis in high dimensions.
Speed: Clustering algorithms can be computationally expensive, especially for large datasets. By reducing the dimensionality of the data with PCA, the computational burden of the clustering algorithm can be significantly reduced, making it faster and more computationally efficient. [3]
Improved Clustering Results: PCA can help to enhance the performance of a clustering algorithm by transforming the data into a new coordinate system that better separates the underlying clusters. This can lead to more accurate and meaningful results, especially for datasets with complex structures. [4]
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3406 1.5870 0.93841 0.7064 0.61036 0.35234 0.28299
## Proportion of Variance 0.5479 0.2519 0.08806 0.0499 0.03725 0.01241 0.00801
## Cumulative Proportion 0.5479 0.7997 0.88779 0.9377 0.97495 0.98736 0.99537
## PC8 PC9 PC10
## Standard deviation 0.18679 0.10552 0.01680
## Proportion of Variance 0.00349 0.00111 0.00003
## Cumulative Proportion 0.99886 0.99997 1.00000
Two components seems to be the best to explain the dataset. 0.7997 cumulative proportion of the PC2 increases to 0.887 cumulative proportion in the PC3. This increase can be dismissed. At PC3 Proportion of variance is very low. This means that PC2 is the best number for the principle component.
## [1] 5.3566238349 2.5608177644 0.8603864653 0.5572101619 0.3467825159
## [6] 0.2052861825 0.0711745077 0.0405225792 0.0009114765 0.0002845116
Eigen values show that as the closest value to 1, 3 components is the best.
According to scree plot’s elbow, 3 components is the best.
## PC1 PC2 PC3
## radius -0.36393793 0.313929073 -0.12442759
## texture -0.15445113 0.147180909 0.95105659
## perimeter -0.37604434 0.284657885 -0.11408360
## area -0.36408585 0.304841714 -0.12337786
## smoothness -0.23248053 -0.401962324 -0.16653247
## compactness -0.36444206 -0.266013147 0.05827786
## concavity -0.39574849 -0.104285968 0.04114649
## concave.points -0.41803840 -0.007183605 -0.06855383
## symmetry -0.21523797 -0.368300910 0.03672364
## fractal.dimension -0.07183744 -0.571767700 0.11358395
When the output containing the variables expressed by the components was analyzed, it was realized that the third component expressed only the Texture variable. It was also possible to see the anomaly of the Texture variable in the correlation analysis. It is understandable that this variable, which has no significant correlation with any variable, is expressed by another component. If the number of components is chosen to be two, it is also observed that 2 components do not express the Texture variable well. Adding a component for a single variable did not seem to make much sense with an explanatory cost of 0.9. The low correlation of the Texture variable with other variables was also taken into account to reach this decision.
x <- fa.parallel(data, fm="pa", fa="both", n.iter=1)
## Parallel analysis suggests that the number of factors = 2 and the number of components = 2
fa.parallel function in the psych package decides the best component number by itself and it decided the component number to be two.
As it can be seen from the diagram, following variables are explained by the following components.
PC1 : Radius, Perimeter, Area, Concave Points, Concavity, Texture
PC2 : Fractal Dimension, Smoothness, Compactness, Symmetry
When the contributions of the observations in the PC1 and PC2 graphs are analyzed, a clustering is observed in the upper right and lower right. It can be said that these observations express similar characteristics. For example, when the values of the 79th observation at the bottom left are examined, it can be seen that it has values close to the maximum for all variables except Radius and Texture. When the 569th observation values on the opposite axis are examined, it can be seen that Smoothness and Concavity have minimum values, while Texture has a value above the 3rd quartile.
When the PCA graph of the variables is analyzed, it can be said that the variables with positive correlation point to the same regions. While Area has a positive correlation with Perimeter, which is in the same component, it has a negative correlation with fractal dimension, which is in a different component. The contributions of the variables can be better seen through this graph.
For cluster analysis, PCA applied dataset will be used.
The Hopkins statistic is a measure used to determine the likelihood that a dataset is generated from a uniform distribution, which is useful for determining whether a dataset is suitable for clustering. [5], [6]
The Hopkins statistic is calculated as follows:
Generating a random sample of n points from the dataset, where n is a small number (typically n=50).
Generating a random sample of n points from a uniform distribution, with the same number of dimensions as the dataset.
Calculating the average distance between each point in the dataset sample and its nearest neighbor in the dataset sample (d(data)).
Calculating the average distance between each point in the uniform sample and its nearest neighbor in the uniform sample (d(unif)).
Calculating the Hopkins statistic
A value of Hopkins statistic close to 1 indicates that the dataset is suitable for clustering, while a value close to 0 indicates that the dataset is not suitable for clustering and might have been generated from a uniform distribution.
hopkins.data <- hopkins(pcadata, n = nrow(pcadata)-1)
hopkins.data
## $H
## [1] 0.1976088
The Hopkins value for this data set is 0.1908455. This indicates that the data set is clusterable.
K-means is a popular clustering algorithm that groups similar observations together (clusters) based on a set of features. The main idea behind k-means is to define spherical clusters where the observations in the same cluster are as similar as possible and observations in different clusters are as dissimilar as possible. [7], [8], [9]
The steps to perform k-means clustering are:
Select k, the number of clusters, that you want to form in the data.
Select k random points from the dataset as the initial centroids (cluster center)
Assign each observation to the cluster whose centroid is closest to it.
Recalculate the centroids as the mean of all the observations in each cluster.
Repeat steps 3 and 4 until the cluster assignments no longer change or reach a maximum number of iterations.
It’s important to note that the final clusters may depend on the initial conditions, so it’s recommended to run k-means multiple times with different initial centroids, then choose the best solution. Also k-means is sensitive to the scale of the data, so it’s recommended to scale the data before applying the k-means algorithm. K-means is efficient for large datasets, but it’s not well suited for non-globular clusters or clusters of different densities. After applying the k-means algorithm, the resulting output will be k clusters where each cluster has its own centroid, and each observation will be assigned to the cluster to which it is closest. These clusters can be used for further analysis or interpretation of the data.
The elbow method is a technique used to determine the optimal number of clusters for a k-means clustering analysis. The idea behind the elbow method is to run k-means clustering on the dataset for a range of values of k (number of clusters), and for each value of k calculate the sum of squared distances of each point from its closest centroid (SSE). The elbow point is the point on the plot of SSE against the number of clusters (k) where the change in SSE begins to level off, indicating that adding more clusters doesn’t improve the model much. [10], [11]
The steps to perform the elbow method are:
Select a range of k values, usually from 1 to 10 or the square root of the number of observations in the dataset.
Run k-means clustering for each k value and calculate the SSE (sum of squared distances of each point from its closest centroid).
Plot the SSE for each k value.
The point on the plot where the SSE starts to decrease at a slower rate is the elbow point, and the corresponding number of clusters is the optimal value for k.
When the Elbow Method graph is analyzed, it can be said that it is not possible to make a definite decision for the number of clusters, but two clusters can be selected.
The average silhouette method is a technique used to determine the optimal number of clusters for a clustering analysis. It measures the similarity of each point to its own cluster compared to other clusters. The silhouette value of a point is a measure of how similar that point is to other points in its own cluster compared to other clusters.[12], [13]
The steps to perform the average silhouette method are:
Select a range of k values, usually from 1 to 10 or the square root of the number of observations in the dataset.
Run clustering algorithm (such as k-means or hierarchical clustering) for each k value
For each point in the dataset, calculate its silhouette value using the formula: (b-a)/max(a,b) where a is the mean distance to the points in the same cluster, and b is the mean distance to the points in the closest other cluster.
Calculate the average silhouette value for all points in the cluster.
Plot the average silhouette value for each k value.
The k value that corresponds to the highest average silhouette value is the optimal number of clusters.
When the Silhouette graph is analyzed, it can be observed that the highest silhouette value is in two clusters. However, 3 clusters can also be tried since there is not much difference between them.
The gap statistic is a technique used to determine the optimal number of clusters for a clustering analysis. It compares the observed within-cluster variation for different values of k with the variation expected under a null reference distribution of the data. [14]
The steps to perform the gap statistic method are:
Select a range of k values, usually from 1 to 10 or the square root of the number of observations in the dataset.
Run the clustering algorithm (such as k-means or hierarchical clustering) for each k value and calculate the within-cluster variation Wk.
Generate B reference datasets by randomly sampling the original data and calculate the within-cluster variation W*k for each dataset.
Calculate the gap statistic
Plot the gap statistic for each k value.
The k value that corresponds to the maximum gap statistic is the optimal number of clusters.
The Gap Statistics value also indicates that the most appropriate number of clusters is two.
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 9 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 5 proposed 7 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
When the output of the NbClust package was analyzed, it was found that 9 methods suggested 2 clusters and 6 methods suggested 3 clusters. According to majority rule, the best number of cluster is 2. However, both 2 and 3 clusters will be examined for k-means, k-medoids, and hierarchical clustering alhorithms.
## K-means clustering with 2 clusters of sizes 398, 171
##
## Cluster means:
## PC1 PC2
## 1 1.289695 -0.03214799
## 2 -3.001746 0.07482399
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1
## [38] 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 1
## [75] 1 1 1 2 2 1 1 1 2 2 1 2 1 2 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1
## [112] 1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 2 2 2 1 2 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1
## [149] 1 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 2 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 2 2 1 1 1
## [186] 1 1 1 1 1 2 1 1 1 2 1 2 2 2 1 1 2 2 2 1 1 1 1 1 1 2 1 2 2 2 1 1 1 2 2 1 1
## [223] 1 2 1 1 1 1 1 2 2 1 1 2 1 1 2 2 1 2 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 1 2 2 2
## [260] 2 2 1 2 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 2 2 2 1 1
## [334] 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2
## [371] 2 1 2 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1
## [408] 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 2 1 1
## [445] 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1
## [482] 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 2 2 1 1 1 1 2 1 1 2 1 1 1 2 2
## [519] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [556] 1 1 1 1 1 1 1 2 2 2 2 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 1121.768 1216.540
## (between_SS / total_SS = 48.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
When the result of the k-means clustering with 2 clusters is examined, the followings are founded:
There are 398 observations in cluster 1, 171 observations in cluster 2.
Total within cluster sum of squares for clusters are 1121.768 and 1216.540.
It is best for within cluster sum of squares for each cluster to be closer to each other. In this case, they are very close.
This clustering result explain %48.5 of the separation.
Separation can be observed only in PC1 dimension. Within sum of square of the cluster 2 is much than the cluster 1. The reason of this needs to be the difference between observation numbers of the clusters.There is no visible overlap between clusters.
NOTE:
Cluster Validation of 2 clustered k-means is done outside of this report. It can be seen in the script. However, validation coefficients will be compared after all algorithms are analyzed.
## K-means clustering with 3 clusters of sizes 117, 117, 335
##
## Cluster means:
## PC1 PC2
## 1 -3.356223 1.1362390
## 2 -1.045541 -1.8948581
## 3 1.537332 0.2649505
##
## Clustering vector:
## [1] 2 1 1 2 1 2 1 2 2 2 3 2 1 3 2 2 3 2 1 3 2 3 2 1 1 2 2 1 2 1 1 2 1 1 2 1 2
## [38] 3 3 2 3 2 1 2 3 1 3 2 3 3 3 3 3 1 3 3 1 2 3 3 2 3 2 3 2 2 3 3 2 3 1 2 1 3
## [75] 3 3 2 1 1 3 3 2 1 1 3 1 3 1 3 2 3 3 3 3 2 1 3 3 3 2 3 3 3 3 3 2 3 3 1 3 3
## [112] 2 2 2 3 3 3 2 2 1 3 1 1 2 3 3 3 1 2 1 3 2 2 3 1 3 3 3 2 3 3 2 3 3 3 2 2 3
## [149] 3 3 2 2 2 3 3 3 1 3 3 3 2 1 1 3 1 3 3 3 1 3 3 3 2 3 3 3 2 1 3 3 1 1 3 3 3
## [186] 3 1 3 3 3 2 3 3 2 2 3 2 1 1 2 3 1 1 2 3 3 3 3 2 3 1 3 1 1 2 2 3 3 1 1 3 2
## [223] 3 2 3 3 3 3 3 2 1 3 3 1 3 3 1 1 3 1 3 3 2 3 1 3 3 3 3 3 1 3 1 1 1 2 1 2 2
## [260] 1 1 3 1 3 1 1 3 3 3 3 3 3 1 3 3 2 3 1 3 3 1 3 1 2 3 3 3 3 2 3 2 3 3 3 3 3
## [297] 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 3 2 1 3 1 3 3 3 3 2 2 2 3 3
## [334] 3 3 1 3 1 3 1 3 3 3 1 3 3 3 3 3 2 3 2 1 2 3 3 2 3 3 3 3 3 3 3 3 1 1 3 1 1
## [371] 1 3 1 1 3 2 2 3 3 2 2 3 3 3 3 3 3 3 3 1 3 3 2 1 3 3 3 3 3 3 1 3 3 3 3 3 3
## [408] 3 1 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 2 3 3 3 3 3 2 2 1 1 3 2 3 3 3 3 3 1 3 3
## [445] 1 3 1 3 3 1 3 1 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 1 2 3 3 3 3 3 3 3 3 3 2 3
## [482] 3 2 3 2 2 3 1 3 3 3 3 1 3 3 3 2 3 1 1 2 2 2 1 2 2 2 2 3 2 3 3 2 3 3 3 1 1
## [519] 2 2 2 1 3 3 3 3 3 3 2 3 3 3 3 1 3 1 2 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [556] 3 3 3 3 3 3 3 2 1 1 1 3 1 3
##
## Within cluster sum of squares by cluster:
## [1] 620.5682 455.6488 634.0446
## (between_SS / total_SS = 62.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
When the result of the k-means clustering with 3 clusters is examined, the followings are founded:
There are 117 observations in cluster 1, 117 observations in cluster 2, and 335 observations in cluster 3.
Total within cluster sum of squares for clusters are 620.5682, 455.6488, and 634.0446.
It is best for within cluster sum of squares for each cluster to be closer to each other. In this case, WSS of cluster 2 is less than the other clusters.
This clustering result explain %62.3 of the separation.
There is no overlap between clusters.
Separation can be observed both in PC1 and in PC2 dimensions.
Within sum of square of the cluster 1 is more than other clusters.
NOTES:
k-means clustering for k=5: Despite the explanatory power being higher for 5 clusters (76.3) compared to 2 and 3 clusters, the average silhouette value is lower (0.36). The separation was achieved in both dimensions without overlapping and cluster element numbers are close to each other. However, when considering the results given by label and cluster validity statistics, the decision was made not to include it in the report.
k-mean clustering for k=7: Despite the explanatory power being higher for 7 clusters (82.1) compared to 2 and 3 clusters, the average silhouette value is lower (0.35). The separation was realized with little overlap in both dimensions and the cluster element numbers are even closer to each other. Although the difference between the within-cluster variances decreases; considering the results given by label and cluster validity statistics, the decision was made not to include it in the report.
K-medoids is a clustering algorithm that is similar to k-means, but instead of using the mean of the observations in each cluster as the centroid, it uses one of the observations in the cluster as the “medoid.” The main idea behind k-medoids is to define clusters where the total dissimilarity between observations and the medoid is minimized. The k-medoids algorithm is also known as Partitioning Around Medoids (PAM) algorithm. [15], [16]
The steps to perform k-medoids clustering are:
Select k, the number of clusters, that you want to form in the data.
Select k random observations from the dataset as the initial medoids.
Assign each observation to the cluster whose medoid is closest to it based on a distance metric.
Recalculate the medoids as the observation in each cluster that minimizes the total dissimilarity to the other observations in the same cluster.
Repeat steps 3 and 4 until the cluster assignments no longer change or reach a maximum number of iterations.
It’s important to note that k-medoids is more robust to noise and outliers than k-means, it’s also more efficient for handling categorical variables. However, k-medoids is more computationally expensive than k-means because it requires the calculation of all pairwise distances between observations at each iteration. Like k-means, k-medoids is sensitive to the initial conditions and it’s recommended to run the algorithm multiple times and choose the best solution.
After applying the k-medoids algorithm, the resulting output will be k clusters where each cluster has its own medoid, and each observation will be assigned to the cluster to which it is closest. These clusters can be used for further analysis or interpretation of the data.
When the Elbow Method graph is analyzed, it can be said that it is not possible to make a definite decision for the number of clusters, but two clusters can be selected.
When the Silhouette graph is analyzed, it can be observed that the highest silhouette value is in two clusters.
The Gap Statistics value also indicates that the most appropriate number of clusters is two.
It was concluded that the optimal number of clusters is two based on the three methods considered when the graph was plotted. Trials were made for 2, 3, 4, and 5 clusters in the analysis. However, only the analysis made for 2 and 3 cluster numbers were included in the report.
In the analysis for 4 clusters, it was noticed that the separation took place in both dimensions. The variance within the two clusters was found to be low, while the variance of the other two clusters was found to be disproportionately high. Due to the lower Silhouette value (0.34) compared to the other one or two clusters and considering the label, the decision was not to include this analysis in the report.
In the analysis for 5 clusters, the separation was also observed in both dimensions. Again, the variance within two clusters was higher compared to the other three clusters. The Silhouette value continued to decrease (0.33) and, considering the label, the decision was not to include this analysis with this number of clusters in the report.
## Medoids:
## ID PC1 PC2
## [1,] 499 -2.357211 0.30131315
## [2,] 269 1.358672 0.03762238
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2
## [38] 2 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 1 2 2 2 2 1 2 1 2
## [75] 2 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2
## [112] 2 1 2 2 2 2 1 1 1 2 1 1 2 2 2 2 1 1 1 2 1 1 2 1 2 2 2 1 2 2 1 2 2 2 2 1 2
## [149] 2 2 2 2 1 2 2 2 1 2 2 2 2 1 1 2 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2
## [186] 2 1 2 2 2 1 2 2 2 1 2 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1 2 1 1 1 1 2 2 1 1 2 2
## [223] 2 1 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1 1 1 2 1 1 1
## [260] 1 1 2 1 2 1 1 2 2 2 2 2 2 1 2 1 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2
## [297] 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 1 1 1 2 2
## [334] 2 2 1 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1
## [371] 1 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2
## [408] 2 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 1 2 2
## [445] 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2
## [482] 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 1 2 2 2 2 1 2 2 1 2 2 2 1 1
## [519] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [556] 2 2 2 2 2 2 2 1 1 1 1 1 1 2
## Objective function:
## build swap
## 1.806580 1.700399
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
There are 499 observations in cluster 1, 269 observations in cluster 2 which can be stated as unbalanced.
No overlap is observed when two and three dimensional graphs are analyzed. Just like in the k-means, it is observed that the separation occurs only in the PC1 dimension. The variance in the first cluster shown in red color is higher.
## Medoids:
## ID PC1 PC2
## [1,] 169 -2.8020980 0.6466461
## [2,] 107 0.7507288 -1.3117501
## [3,] 296 1.6697824 0.6918928
## Clustering vector:
## [1] 1 1 1 2 1 2 1 2 2 2 3 1 1 3 1 1 3 1 1 3 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2
## [38] 3 3 2 3 2 1 2 2 1 3 2 2 3 3 3 3 1 3 3 1 2 3 2 2 2 1 2 2 1 2 3 2 3 1 2 1 2
## [75] 3 3 2 1 1 3 2 2 1 1 2 1 2 1 2 2 3 1 3 3 1 1 2 3 3 2 3 2 3 2 2 2 2 3 1 3 2
## [112] 2 1 2 2 3 2 1 1 1 3 1 1 2 3 3 3 1 1 1 2 1 1 3 1 3 3 3 1 2 3 1 2 3 3 2 2 3
## [149] 2 3 2 2 2 3 2 3 1 3 3 3 2 1 1 2 1 3 3 1 1 3 2 3 1 3 3 3 2 1 3 3 1 1 3 3 3
## [186] 3 1 3 3 3 1 3 3 2 1 3 2 1 1 2 2 1 1 2 2 3 2 3 2 3 1 3 1 1 2 2 2 3 1 1 3 2
## [223] 2 1 3 3 2 3 3 2 1 3 3 1 3 3 1 1 3 1 3 3 2 3 1 2 3 2 2 2 1 3 1 1 1 2 1 1 1
## [260] 1 1 3 1 3 1 1 2 3 3 2 3 2 1 2 3 2 3 1 3 2 1 3 1 1 3 3 3 3 2 3 2 3 2 3 3 3
## [297] 3 3 3 2 1 3 1 2 3 3 3 3 3 3 3 3 3 3 2 3 3 1 2 3 2 1 2 1 3 3 3 3 1 1 1 2 2
## [334] 3 3 1 2 1 2 1 2 2 2 1 2 2 3 3 3 2 3 1 1 1 3 3 2 3 2 2 3 3 3 3 3 1 1 3 1 1
## [371] 1 3 1 1 3 2 2 3 3 2 2 3 3 2 3 3 3 3 2 1 2 2 1 1 2 3 2 3 3 3 1 3 3 3 3 3 3
## [408] 3 1 3 3 2 3 3 3 2 2 1 3 3 3 2 2 2 2 3 2 3 3 3 1 2 1 1 3 2 3 3 3 3 2 1 3 3
## [445] 1 2 1 3 3 1 3 1 3 2 3 3 3 3 3 3 1 1 3 3 3 3 3 3 1 2 2 3 3 3 2 3 3 3 2 1 3
## [482] 3 2 3 2 2 3 1 2 3 3 3 1 3 3 3 2 3 1 1 2 2 2 1 2 2 2 2 3 1 3 3 2 3 3 2 1 1
## [519] 2 2 2 1 3 2 3 2 2 3 2 2 2 2 3 1 2 1 2 2 2 2 2 2 3 3 3 3 3 2 3 3 3 2 3 3 3
## [556] 3 2 3 3 3 3 3 1 1 1 1 1 1 3
## Objective function:
## build swap
## 1.537794 1.437532
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
There are 159 observations in cluster 1, 19 observations in cluster 2, and 241 observations in cluster 3 which is also unbalanced.
When the cluster graph is analyzed, it can be seen that there is no overlap. It can be seen that the separation occurs in both PC1 and PC2 dimensions. While the variance in the first cluster shown in red is high, the variance in the third cluster shown in blue is low.
Hierarchical Clustering is a method of clustering in which the objects are organized into a tree-like structure called a dendrogram. The main idea behind hierarchical clustering is to start with each object as a separate cluster and then combine them into larger clusters iteratively based on their similarity. There are two main types of hierarchical clustering: Agglomerative and Divisive. [17], [18], [19]
Agglomerative hierarchical clustering:
Start with each object as a separate cluster
Find the two most similar clusters and combine them into a new cluster
Repeat step 2 until all objects are in the same cluster
Divisive hierarchical clustering:
Start with all objects in the same cluster
Divide the largest cluster into two smaller clusters based on their similarity
Repeat step 2 until each object forms its own cluster
Hierarchical clustering can be represented by a dendrogram, which is a tree-like structure that shows the hierarchy of clusters and the relations between them. The dendrogram can be cut at a certain height to obtain a flat clustering solution with a specific number of clusters.
It’s important to note that hierarchical clustering is sensitive to the scale and density of the data, so it’s important to scale the data before applying the method. Also, the choice of linkage method (single, complete, average, etc) is important and it affects the final clustering. Additionally, hierarchical clustering is computationally expensive for large datasets and it’s not suitable for handling high-dimensional data.
The Ward’s linkage method will be started for hierarchical clustering. Hierarchical clustering will be performed using both euclidean and manhattan distance metrics and dendograms will be visualized. Then, the cophenetic distances of the clustering will be measured. The correlation between the original distance and cophenetic distance will be examined and a decision will be made on which distance metric to proceed with.
Ward’s method is an agglomerative linkage method used in hierarchical clustering. It is based on the idea of minimizing the variance of the distances between the observations in the new cluster and the cluster centroid. This linkage method aims to minimize the total within-cluster variance of the new cluster formed by merging two smaller clusters.
The cophenetic distance is a measure used in hierarchical clustering to evaluate the similarity between two observations in the dendrogram produced by the clustering algorithm. It is defined as the distance between two observations in the original data space at the level in the dendrogram where they first merge into the same cluster[20].
The cophenetic distance is calculated as follows:
Perform hierarchical clustering on the data to produce a dendrogram
For each pair of observations, find the level in the dendrogram where they first merge into the same cluster.
Compute the distance between the two observations in the original data space. Repeat steps 2 and 3 for all pairs of observations.
The cophenetic distance is used to evaluate the quality of the clustering solution by comparing it to the original data space. A high correlation between the cophenetic distance and the original distance between observations in the data space indicates that the clustering solution is preserving the structure of the data well.
It’s important to note that the cophenetic distance is computationally expensive for large datasets, also the linkage method used in the hierarchical clustering affects the final clustering and it’s recommended to compare the results with other linkage methods and visualizing the data.
dist_euc <- dist(pcadata, method="euclidean")
dist_man <- dist(pcadata, method="manhattan")
coph_e <- cophenetic(hc_e)
cor(dist_euc,coph_e)
## [1] 0.6711685
coph_m <- cophenetic(hc_m)
cor(dist_man,coph_m)
## [1] 0.6018289
When the correlation between Cophenetic and distance matrix is examined, it is observed that hierarchical clustering with euclidean distance gives better results.
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 7 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 5 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
When the output of the NbClust package was analyzed, it was found that 7 methods suggested 2 clusters and 5 methods suggested 3 clusters. According to majority rule, the best number of cluster is 2. However, both 2 and 3 clusters will be examined for k-means, k-medoids, and hierarchical clustering alhorithms.
In the analysis done for 6 clusters, it was found that the separation occurred in both dimensions, but there were some overlaps in some clusters. Despite the similar number of elements in the clusters, it was observed that the variance within two clusters was higher compared to the other four clusters. The lower silhouette value (0.32) and considering the label, the decision was made not to include the analysis done with this number of clusters in the report.
## grupward2
## 1 2
## 180 389
There are 180 observations in cluster 1, 389 observations in cluster 2.
When the cluster graph is analyzed, overlap can be observed. It can be seen that the separation occurs only in PC1. While the variance in the first cluster shown in red is high, the variance in the second cluster shown in blue is low.
## grupward3
## 1 2 3
## 104 76 389
There are 104 observations in cluster 1, 76 observations in cluster 2, and 389 observations in cluster 3.
When the cluster graph is analyzed, overlap can be observed. It can be seen that the separation occurs only in both PC1 and PC2. While the variance in the first cluster shown in red is high, the variance in the second cluster shown in green is low.
The average linkage method (also known as UPGMA) is an agglomerative linkage method used in hierarchical clustering. It is based on the idea of minimizing the average distance between observations in the two clusters being merged. The average linkage method is a measure of the dissimilarity between two clusters, defined as the average distance between the points in one cluster and the points in the other.[21], [22]
The steps to perform hierarchical clustering using average linkage method are:
Start with each observation as a separate cluster Compute the distance matrix between all pairs of clusters. Merge the two clusters that have the minimum average distance between their observations and form a new cluster. Repeat steps 2 and 3 until all observations are in the same cluster. The average linkage method is sensitive to the scale of the variables, so it’s recommended to standardize the variables before applying the method. Average linkage method tends to create elongated and non-compact clusters, and it’s more efficient for handling datasets with small number of observations and variables.
## [1] 0.8013459
## [1] 0.7550863
When the correlation between Cophenetic and distance matrix is examined, it is observed that hierarchical clustering with euclidean distance gives better results.
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 6 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 7 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 4 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 5 .
When the output of the NbClust package was analyzed, it was found that 9 methods suggested 2 clusters and 6 methods suggested 3 clusters. According to majority rule, the best number of cluster is 2. However, both 2 and 3 clusters will be examined for k-means, k-medoids, and hierarchical clustering alhorithms.
In the clustering with 7 clusters, there was a significant difference in the number of cluster elements. Although some clusters had a very small number of elements (3), others had a large number of elements (305). The within-cluster variances were also found to be imbalanced. Based on the average Silhouette value (0.41) and other metrics, it was decided not to include it in the report.
## grupav2
## 1 2
## 23 546
There are 23 observations in cluster 1, 546 observations in cluster 2. It can easily be seen that clusters are unbalanced.
When the cluster graph is analyzed, overlap can be observed. It can be seen that the separation occurs only in PC1. While the variance in the second cluster shown in green is high, the variance in the first cluster shown in blue is low.
## grupav3
## 1 2 3
## 23 541 5
There are 23 observations in cluster 1, 541 observations in cluster 2, and 5 observations in cluster 3. It can easily be seen that clusters are unbalanced.
When the cluster graph is analyzed, overlap can be observed. It can be seen that the separation occurs only in both PC1 nad PC2. While the variance in the second cluster shown in green is high, the variance in the third cluster shown in blue is low.
Model-based clustering is a method of clustering in which a probabilistic model is fit to the data, and the clusters are defined as the parameters of the model. The main idea behind model-based clustering is to assume that the data is generated by a certain probability distribution, and the clusters correspond to different modes of that distribution.
There are several types of model-based clustering methods such as:
Gaussian Mixture Model (GMM): assumes that the data is generated by a mixture of Gaussian distributions, and estimates the parameters of the distributions, such as means and covariances, to define the clusters. I will use this version in this analysis.
Latent Dirichlet Allocation (LDA): a generative probabilistic model used to classify text in natural language processing and information retrieval. It assumes that each document is a mixture of topics and each topic is a mixture of words.
Hidden Markov Model (HMM): a statistical model used to predict a sequence of hidden states from a sequence of observations. It can be used for clustering sequences of data.
Model-based clustering methods have some advantages over traditional clustering methods, such as the ability to model complex data distributions and handle missing data. However, it’s also sensitive to the initial conditions and the number of clusters and it’s computationally expensive for large datasets.
NOTES:
The model-based clustering performs the clustering process by making estimates using the EM algorithm. Each cluster, centered around the mean, becomes more dense with increasing proximity to the mean. The argument G in the function allows for modification to determine the number of clusters, ranging from 1 to 9. Each value was tested, but it was concluded that the best result was obtained with a G value of two.
Analyses for G values of three showed that the first two cluster sizes were close to each other (235-212), but the third cluster size was smaller (122). The best model was found to be the VII model, which meant that the volume, shape, and orientation were different. This method was not included in the report because the number of uncertain observations was high, and the average Silhouette value was lower (0.27) as shown in the uncertainty graph.
Analyses for G values of four showed that the cluster sizes were 52, 158, 213, and 146, respectively. The best model was found to be the VII model, which meant that the volume, shape, and orientation were different. This method was not included in the report because the number of uncertain observations was high and the average Silhouette value was lower (0.33) as shown in the uncertainty graph.
Analyses for G values of five showed that the cluster sizes were 29, 112, 187, 114, and 127, respectively. The best model was found to be the VII model, which meant that the volume, shape, and orientation were different. This method was not included in the report because the number of uncertain observations was high and the average Silhouette value was lower (0.32) as shown in the uncertainty graph.
mc <- Mclust(pcadata, G=2)
summary(mc)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 2 components:
##
## log-likelihood n df BIC ICL
## -2226.677 569 9 -4510.449 -4654.213
##
## Clustering table:
## 1 2
## 253 316
Observation numbers for each clusters is as follows: C1 : 253 C2 : 316
The best model is the VVI parameter. VVV means equal orientation, with varying volume and shape.
fviz_mclust(mc, "classification", geom = "point",
pointsize = 1.5, palette = "jco")
When the clustering graph is examined, it is observed that there are no overlaps. Blue dots are easily visible on the far right of the PC1 length. This can be interpreted as an interesting result. Separation occurred only in the PC1 dimension.
fviz_mclust(mc, "uncertainty", palette = "jco",pos = FALSE)
Observations with larger points in the uncertainty graph indicate that the clustering results are more uncertain. It can be observed that uncertainty increases between two clusters which is not suprising.
Density-based clustering is a type of clustering algorithm that groups together data points that are closely packed together, while separating those that are more sparsely distributed. The main idea behind density-based clustering is to identify regions in the feature space where the data points are dense, and then to extract clusters based on these regions. [25]
One commonly used density-based clustering algorithm is DBSCAN (Density-Based Spatial Clustering of Applications with Noise). DBSCAN groups together data points that are close to each other based on a distance measure and a density threshold. It defines clusters as dense regions of points that are separated from other dense regions by regions of lower point density. [26]
Another example of density-based clustering is HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) which is an extension of DBSCAN algorithm, it can discover clusters of varying densities and shapes, and it can also discover clusters with different numbers of points, and it is less sensitive to parameter tuning.
Density-based clustering is useful for data sets that contain clusters of different shapes and sizes, and for data sets with noise and outliers.
In density-based clustering, the number of clusters does not need to be predetermined, but the values of MinPts and eps do. The eps parameter defines the radius of the neighbors around a point x. This is called the epsilon neighborhood of x. The MinPts parameter is the minimum number of neighbors within the “eps” radius. KNN distplot can be used to determine these values.
k stands for MinPts. After several trials, 5 was decided upon. When analyzing the kNNdisplot, just like the Elbow Method, the point where the line makes an “elbow” should be determined. This point should be chosen as the eps value. After various trials, the most appropriate value was decided to be 0.6.
NOTES:
In the analysis with minPts 10, eps 1, the number of clusters was found to be one. Since one cluster means no cluster, it was decided not to include it in the report.
The analysis with MinPts 10, eps 0.8 gave the same result as the analysis with eps 1, so it was decided not to include it in the report.
In the analysis with minPts 5, eps 0.6, the number of clusters was two. While there were 513 observations in the first cluster, the number of observations in the second cluster was 7 and the variance differences caused by this imbalance caused suspicion. For this reason, it was not included in the report.
## dbscan Pts=569 MinPts=10 eps=0.6
## 0 1 2
## border 96 58 27
## seed 0 49 339
## total 96 107 366
Density-based clustering divided the dataset into two clusters. The output shows a total of 96 noise values. There are 58 border points in the first cluster and 27 in the second cluster. There are 49 seed points in the first cluster and 339 seed points in the second cluster.
When the graph is examined, it can be seen that the element difference between the clusters is small. The excess of noise values is also noteworthy.
Before sharing the validity measurements made after clustering, I wanted use the clValid function in the clValid package. This function performs clustering with the given clustering methods and recommends the most appropriate clustering algorithm and number of clusters. This function will be used for three validity criteria. These criteria are internal and external cluster validity and clustering stability.
##
## Clustering Methods:
## kmeans pam hierarchical model
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## kmeans Connectivity 47.9456 60.6964 78.5956 94.8873 91.6552
## Dunn 0.0058 0.0074 0.0136 0.0128 0.0125
## Silhouette 0.4923 0.4417 0.4194 0.3631 0.3566
## pam Connectivity 33.4456 73.8468 88.6940 103.3563 117.7881
## Dunn 0.0133 0.0046 0.0072 0.0067 0.0112
## Silhouette 0.4804 0.3627 0.3369 0.3341 0.2982
## hierarchical Connectivity 10.0647 17.1679 20.6960 38.8679 43.0341
## Dunn 0.0637 0.0719 0.0719 0.0294 0.0294
## Silhouette 0.5363 0.4703 0.4538 0.4200 0.4149
## model Connectivity 60.9706 67.5242 96.2040 105.0313 112.6198
## Dunn 0.0023 0.0055 0.0057 0.0066 0.0041
## Silhouette 0.4125 0.2706 0.3351 0.3264 0.2854
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 10.0647 hierarchical 2
## Dunn 0.0719 hierarchical 3
## Silhouette 0.5363 hierarchical 2
Internal clustering validity criteria includes Connectivity, Dunn and Silhouette criteria. The clValid function indicates hierarchical clustering as the most appropriate algorithm and 2 clusters as the optimal number of clusters.
clustervalid <- data.frame( Clustering.Algorithm = c("2k-means", "3k-means", "2k-medoids", "3k-medoids", "Ward.D2-2", "Ward.D2-3", "Average-5", "Average-2", "Model.Based", "Density.Based"),
Cluster.Number = c(2,3,2,3,2,3,5,2,2,3),
Overlap = c("little", "much", "little", "much", "little", "much", "much", "little", "none", NA),
Negative.Silhouette.Number = c(10,9,26,47,24,30,52,29, NA, NA),
Average.Silhouette.Number = c(0.49, 0.44, 0.48, 0.36,0.48,0.48,0.42,0.54, 0.41,0.14),
Dunn.Index = c(0.005,0.011, 0.013, 0.004, 0.02, 0.035, 0.029, 0.063, 0.002, 0.019),
Connectivity = c(64.96, 87.85, 50.08, 109.02, 40.70, 60.24, 68.88, 20.61, 83.61, 117.14),
Rand = c(0.64, 0.49, 0.72, 0.39, 0.56, 0.51, 0.44, 0.60, 0.53, 0.09),
VI = c(0.56, 0.93, 0.49, 1.013189, 0.70, 0.86, 0.80, 0.74, 0.75, NA),
Label = c(55,146,42,194,70,127,189,111,77,142)
)
clustervalid
## Clustering.Algorithm Cluster.Number Overlap Negative.Silhouette.Number
## 1 2k-means 2 little 10
## 2 3k-means 3 much 9
## 3 2k-medoids 2 little 26
## 4 3k-medoids 3 much 47
## 5 Ward.D2-2 2 little 24
## 6 Ward.D2-3 3 much 30
## 7 Average-5 5 much 52
## 8 Average-2 2 little 29
## 9 Model.Based 2 none NA
## 10 Density.Based 3 <NA> NA
## Average.Silhouette.Number Dunn.Index Connectivity Rand VI Label
## 1 0.49 0.005 64.96 0.64 0.560000 55
## 2 0.44 0.011 87.85 0.49 0.930000 146
## 3 0.48 0.013 50.08 0.72 0.490000 42
## 4 0.36 0.004 109.02 0.39 1.013189 194
## 5 0.48 0.020 40.70 0.56 0.700000 70
## 6 0.48 0.035 60.24 0.51 0.860000 127
## 7 0.42 0.029 68.88 0.44 0.800000 189
## 8 0.54 0.063 20.61 0.60 0.740000 111
## 9 0.41 0.002 83.61 0.53 0.750000 77
## 10 0.14 0.019 117.14 0.09 NA 142
The clustering with the highest average silhouette value is the Average linkage method in Hierarchical Clustering. The number of cluster is two.
The clustering with the highest Dunn value is the Average linkage method in hierarchical clustering. The number of clusters is two.
Connectivity takes values from 0 to infinity. It should be as small as possible. The clustering with the highest connectivity value is the Average linkage method in hierarchical clustering. The number of clusters is two.
The Rand index takes values between -1 (no fit) and 1 (perfect fit). The value closest to 1 is the best value. When the Rand values are analyzed for all the methods tested, the K-Medoids algorithm has the closest value to one. The number of clusters appeared to be 2.
The VI index takes values between -1 (no fit) and 1 (perfect fit). The value closest to 1 is the best value. When all Melia change values are analyzed, k-medoids with 2 clusters is the best.
ggplot(clustervalid, aes(x = Label , y = Clustering.Algorithm )) +
geom_bar(stat = "identity", width = 0.1, color="burlywood4", fill = "burlywood") +
theme_minimal()+
labs(title = "Label")+
xlab("Label") +
ylab("Clustering Algorithm") +
theme(axis.text.y = element_text(angle=360, vjust=.5, hjust=1))
The graph above shows the difference between the label frequency and the clustering frequencies. The clustering algorithm with the smallest difference is the one that clusters closest to the label. For this reason, K - Medoids is seen as the most appropriate algorithm. The number of clusters is seen as 2.
According to the recommendations of Clvalid and all other metrics:
Under normal circumstances, the most appropriate clustering method should have been hierarchical and the optimal number of clusters should have been 2. However, in this data set and this analysis, considering the differences in cluster elements, the extreme differences in intra-cluster variances, and the fact that the labels are known; it is thought that it is more appropriate to choose the algorithm with the least frequency difference between the label. In other words, K - Medoids was selected as the optimal clustering algorithm and 2 as the optimal number of clusters. K - Medoids made only 42 false clusters out of 569 observations. Considering that it clustered the data with an error rate of 7%, it can be said that it is a very successful clustering algorithm.
The means of each variable were compared with the cluster averages. The table below is drawn to illustrate this comparison.
| Variables | First Cluster | Second Cluster |
|---|---|---|
| Radius | High | Average |
| Texture | High | Average |
| Perimeter | High | Low |
| Area | High | Low |
| Smoothness | High | Low |
| Compactness | High | Low |
| Concavity | High | Low |
| Concave Points | High | Average |
| Symmetry | High | Average |
| Fractal Dimension | Average | Average |
##
## Descriptive statistics by group
## Diagnosis: B
## vars n mean sd median trimmed mad min max
## radius 1 357 12.15 1.78 12.20 12.17 1.69 6.98 17.85
## texture 2 357 17.91 4.00 17.39 17.52 3.47 9.71 33.81
## perimeter 3 357 78.08 11.81 78.18 78.16 11.13 43.79 114.60
## area 4 357 462.79 134.29 458.40 459.40 127.06 143.50 992.10
## smoothness 5 357 0.09 0.01 0.09 0.09 0.01 0.05 0.16
## compactness 6 357 0.08 0.03 0.08 0.08 0.03 0.02 0.22
## concavity 7 357 0.05 0.04 0.04 0.04 0.03 0.00 0.41
## concave.points 8 357 0.03 0.02 0.02 0.02 0.01 0.00 0.09
## range skew kurtosis se
## radius 10.87 -0.08 -0.05 0.09
## texture 24.10 0.97 1.16 0.21
## perimeter 70.81 -0.06 -0.05 0.62
## area 848.60 0.34 0.27 7.11
## smoothness 0.11 0.66 1.79 0.00
## compactness 0.20 1.20 2.21 0.00
## concavity 0.41 3.44 20.40 0.00
## concave.points 0.09 0.92 0.98 0.00
## ------------------------------------------------------------
## Diagnosis: M
## vars n mean sd median trimmed mad min max
## radius 1 212 17.46 3.20 17.33 17.32 3.36 10.95 28.11
## texture 2 212 21.60 3.78 21.46 21.43 3.25 10.38 39.28
## perimeter 3 212 115.37 21.85 114.20 114.19 23.17 71.90 188.50
## area 4 212 978.38 367.94 932.00 945.98 366.57 361.60 2501.00
## smoothness 5 212 0.10 0.01 0.10 0.10 0.01 0.07 0.14
## compactness 6 212 0.15 0.05 0.13 0.14 0.04 0.05 0.35
## concavity 7 212 0.16 0.08 0.15 0.15 0.07 0.02 0.43
## concave.points 8 212 0.09 0.03 0.09 0.09 0.03 0.02 0.20
## range skew kurtosis se
## radius 17.16 0.49 0.31 0.22
## texture 28.90 0.69 2.25 0.26
## perimeter 116.60 0.60 0.52 1.50
## area 2139.40 1.10 2.17 25.27
## smoothness 0.07 0.47 0.36 0.00
## compactness 0.30 0.82 0.77 0.00
## concavity 0.40 0.89 1.06 0.01
## concave.points 0.18 0.73 0.65 0.00
When the descriptive statistics values extracted according to the M (malignant tumor) and B (benign tumor) labels are examined, it is noticed that the averages of the observations in the M label in the variables are above the average. Considering that the cluster in which the variables are above the average in the clustering results is the first cluster; it can be said that the first cluster refers to malignant tumor and the second cluster refers to benign tumor.
In order not to extend the report further, two variables represented by PC1 and PC2 were selected and two graphs were drawn for each cluster in which the relationship between these variables was analyzed. While drawing the graphs, the cluster names 1 and 2 were changed to M (Malignant Tumor) and B (Benign Tumor) according to the results of the label comparison.
final_data <- mutate(final_data, cluster = ifelse(cluster == 1,"M", "B"))
In the graph for M (Malignant Tumor), a positive relationship is observed starting from Radius 8, Area 250 band and reaching the maximum values of both values. For B (Benign Tumor), a positive relationship is observed in both variables starting at 0 and ending at 1000 for Area and 17 for Radius. From this graph, it can be inferred that the areas of the nucleus of malignant tumors are in a wider range, while the areas of benign tumors do not grow much. The radius of benign tumors does not increase beyond 17.
For cluster B (Benign Tumor), both the Fractal dimension variable and the Smoothness variable have a wide range. For cluster M (Malignant Tumor), both the Fractal dimension variable and Smoothness have a wide range. This may be due to the fact that separation only occurs in the PC1 variable. From this, it can be inferred that it may be misleading to make comments according to the variables in PC2 (Fractal Dimension, Smoothness, Compactness, Symmetry) in the analyses to be made to distinguish between Benign or Malignant Tumor.
References
[1] Bryant, F. B., & Yarnold, P. R. (1995). Principal-components analysis and exploratory and confirmatory factor analysis.
[2] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
[3] Ben-Hur, Asa, and Isabelle Guyon. Detecting stable clusters using principal component analysis. Functional genomics. Humana press, 159-182, 2003.
[4] Ding, Chris, and Xiaofeng He. K-means clustering via principal component analysis. Proceedings of the twenty-first international conference on Machine learning. 2004.
[5] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.
[6] Hopkins, J. W., & Gridgeman, N. T. (1955). Comparative sensitivity of pair and triad flavor intensity difference tests. Biometrics, 11(1), 63-68.
[7] Hartigan, John A., Manchek A. Wong. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) 28., 100-108, 1979
[8] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.
[9] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
[10] Steinley, D., & Brusco, M. J. (2011). Choosing the number of clusters in Κ-means clustering. Psychological methods, 16(3), 285.
[11] Halkidi, Maria, Yannis Batistakis, and Michalis Vazirgiannis. “On clustering validation techniques.” Journal of intelligent information systems 17 (2001): 107-145.
[12] Rousseeuw, Peter J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 1987, 20: 53-65.
[13] Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of intelligent information systems, 17, 107-145.
[14] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
[15] Kaufman, L., & Rousseeuw, P. (1987). Clustering by means of medoids. Statistical Data Analysis Based on the L1-Norm and Related Methods, Y. Dodge Ed.
[16] Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.
[17] Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), 236-244.
[18] Roux, M. (2015). A comparative study of divisive hierarchical clustering algorithms. arXiv preprint arXiv:1506.08977.
[19] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.
[20] Triayudi, A., & Fitri, I. (2018). Comparison of parameter-free agglomerative hierarchical clustering methods. ICIC Express Letters, 12(10), 973-980.
[21] Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86-97.
[22] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.
[23] McNicholas, P. D. (2016). Model-based clustering. Journal of Classification, 33, 331-373.
[24] Kassambara, Alboukadel. Practical guide to cluster analysis in R: Unsupervised machine learning. Vol. 1. Sthda, 2017.
[25] Kriegel, H. P., Kröger, P., Sander, J., & Zimek, A. (2011). Density‐based clustering. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(3), 231-240.
[26] Bäcklund, H., Hedblom, A., & Neijman, N. (2011). A density-based spatial clustering of application with noise. Data Mining TNM033, 33, 11-30.